GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

thisisnic · 2025-05-13T20:50:34Z

Rationale for this change

There is a bug where we end up crashing when working on labelled columns in table

What changes are included in this PR?

Remove labels from columns

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

Draft PR - this works for tables but not datasets yet

GitHub Issue: [R] R arrow cannot handle labelled data in arrow tables #45601

github-actions · 2025-05-13T20:51:00Z

⚠️ GitHub issue #45601 has been automatically assigned in GitHub to PR creator.

thisisnic · 2025-05-24T08:42:52Z

@amoeba I tried the approach you suggested here but because we use as_arrow_table() internally in a lot more functions, we end up breaking roundtripping with Feather etc.

I think if we work only in R, we would want to remove the label and then restore them later, but trying to find an uncomplicated way of doing this.

I think we definitely want to stop the segfault regardless and error instead.

Users technically can use mutate() to change the type to something we can work with, but there'll be resource costs with doing this on a dataset. See my reprex below.

library(haven)
library(arrow)
library(tibble)
library(dplyr)

d <- tibble(
  a = labelled(x = 1:5),
  b = labelled(x = 11:15)
)

tf <- tempfile()
write_parquet(d, tf)

# still fails
read_parquet(tf, as_data_frame = FALSE) %>%
  filter(a > 3) %>%
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater' has no kernel matching input types (<labelled<integer>[0]>, <labelled<integer>[0]>)

tf <- tempfile()
write_parquet(d, tf)

# works
read_parquet(tf, as_data_frame = FALSE) %>%
  mutate(a = as.integer(a)) %>%
  filter(a > 3) %>%
  collect()
#> # A tibble: 2 × 2
#>       a b        
#>   <int> <int+lbl>
#> 1     4 14       
#> 2     5 15

# fails
open_dataset(tf) %>%
  mutate(a = as.integer(a)) %>%
  filter(a > 3) %>%
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater_equal' has no kernel matching input types (<labelled<integer>[0]>, <labelled<integer>[0]>)

# works but potentially higher resource usage
open_dataset(tf) %>%
  mutate(a = as.integer(a)) %>%
  compute() %>%
  filter(a > 3) %>%
  collect()
#> # A tibble: 2 × 2
#>       a b        
#>   <int> <int+lbl>
#> 1     4 14       
#> 2     5 15

thisisnic · 2025-05-24T11:55:33Z

I've stopped it segfaulting on printing, but I think the actual fix needs to be more layers deep.

thisisnic · 2025-05-24T19:08:05Z

I'm also wondering if instead of supporting this we should just stop the segfault and then error appropriately and recommend folks do something like:

open_dataset(whatever) %>%
   mutate(col = cast(col, int32()) %>%
   write_dataset(newlocation)

open_dataset(newlocation) %>%
  filter(col > 3) %>%
  collect()

Otherwise we're getting into the territory of supporting compute functions on extension types, which we don't actually do and if implemented should be done lower down the stack anyway.

More discussion on computing on extension types here: https://lists.apache.org/thread/2j61nrod7x0s5vjhc6q9tlj898drz7rn

amoeba · 2025-05-28T00:24:24Z

Hey @thisisnic, thanks for working on this. I think fixing the segfault and erroring with a helpful message sounds great.

github-actions bot added Component: R awaiting committer review Awaiting committer review labels May 13, 2025

github-actions bot added the Component: C++ label May 25, 2025

thisisnic closed this Jun 12, 2025

thisisnic force-pushed the GH-45601_labelled branch from f37f908 to 316e028 Compare June 12, 2025 15:26

thisisnic mentioned this pull request Jun 13, 2025

GH-46674: [C++] Construct Array from ExtensionType Scalar #46675

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

Uh oh!

thisisnic commented May 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

amoeba commented May 28, 2025

Uh oh!

Uh oh!

GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

Uh oh!

Conversation

thisisnic commented May 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

thisisnic commented May 24, 2025

Uh oh!

amoeba commented May 28, 2025

Uh oh!

Uh oh!

thisisnic commented May 13, 2025 •

edited by github-actions bot

Loading